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00 

Sparse signal reconstruction algorithms have attracted research attention due to their wide applications 
in various fields. In this paper, we present a simple Bayesian approach that utilizes the sparsity constraint 
and a priori statistical information (Gaussian or otherwise) to obtain near optimal estimates. In addition, 
we make use of the rich structure of the sensing matrix encountered in many signal processing applications 

>. 

t^. to develop a fast sparse recovery algorithm. The computational complexity of the proposed algorithm 

is relatively low compared with the widely used convex relaxation methods as well as greedy matching 





pursuit techniques, especially at a low sparsity ratel 



I. Introduction 



Compressive Sensing/Compressed Sampling (CS) is a fairly new field of research that is finding many 
applications in statistics and signal processing |1]. As its name suggests, CS attempts to acquire a signal 
(inherently sparse in some subspace) at a compressed rate by randomly projecting it onto a subspace 
that is much smaller than the dimension of the signal itself. Provided that the sensing matrix satisfies 
a few conditions, the sparsity pattern of such a signal can be recovered non-combinatorially with high 
probability. This is in direct contrast to the traditional approach of sampling signals according to the 
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Nyquist theorem and then discarding the insignificant samples. Generally, most naturally occurring signals 
are sparse in some basis/domain and CS can therefore be utilized for their reconstruction. CS has been 
used successfully in, for example (but not limited to), peak-to-average power ratio reduction in orthogonal 
frequency division multiplexing (OFDM) [2], image processing (one-pixel camera (H), impulse noise 
estimation and cancellation in power-line communication and digital subscriber lines (DSL) [5 ], magnetic 
resonance imaging (MRI) [6], channel estimation in communications systems [7], ultra-wideband (UWB) 
channel estimation [8 ], direction-of-arrival (DOA) estimation (9), and radar design ifTOl , to name a few. 

The CS problem can be set up as follows. Let x G be a P-sparse signal (i.e., a signal that consists 
of P non-zero coefficients in an ^-dimensional space with P « N) in some domain and let y G C M 
be the observation vector with M « N given by 

y = *x + n (1) 

where f is an M x measurement/sensing matrix that is assumed to be incoherent with the domain in 
which x is sparse and n is complex additive white Gaussian noise, CA/"(0, cr^Ly/). As M « N, this is 
an ill-posed problem as there is an infinite number of solutions for x satisfying dTJ. Now if it is known a 
priori that x is sparse, the theoretical way to reconstruct the signal is to solve an £o-norm minimization 
problem using only M = 2P measurements when the signal and measurements are free of noise ifTTTl 

x = min ||x||o subject to y = \I/x. (2) 

X 

Unfortunately, solving the ^cr norm minimization problem is NP-hard HTT1 lTT2l and is therefore not 
practical. Thus, different sub-optimal approaches, categorized as compressive sensing, have been presented 
in the literature to solve this problem. In [12] and |[T3l . it has been shown that x can be reconstructed with 
high probability in polynomial time by using convex relaxation approaches at the cost of an increase in 
the required number of measurements. This is done by solving a relaxed £i-norm minimization problem 
using linear programming instead of £o- norm minimization fl2l . |[T3l 

x = min ||x||i subject to ||y — \l/x||2 < e (3) 

X 

where e = J o\{M + yj2M). For ^i-norm minimization to reconstruct the sparse signal accurately, the 
sensing matrix, should be sufficiently incoherent. In other words, the coherence, defined as n(*f?) = 
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maxj^j K^j^ -)|, should be as small as possible (with = 1 depicting the worst case) lfT2l . In 

|[T4l . it has been shown that these convex relaxation approaches have a Bayesian rendition and may be 
viewed as maximizing the maximum a posteriori estimate of x, given that x has a Laplacian distribution. 
Although convex relaxation approaches are able to recover sparse signals by solving under-determined 
systems of equations, they also suffer from a number of drawbacks (some of which are common to other 
sparse recovery algorithms including lTT6l - |[T9l ) that we discuss below. 

A. Drawbacks of Convex Relaxation Approaches 

1 ) Complexity: Convex relaxation relies on linear programming to solve the convex ^i-norm minimiza- 
tion problem, which is computationally relatively complex (its complexity is of the order 0(M 2 iV 3 / 2 ) 
when interior point methods are used ll37l0 . This approach can therefore not be used in problems with very 
large dimensions. To overcome this drawback, many greedy algorithms have been proposed that recover 
the sparse signal iteratively. These include Orthogonal Matching Pursuit (OMP) lfT51 . Ifl6l . Regularized 
Orthogonal Matching Pursuit (ROMP) H3, Stagewise Orthogonal Matching Pursuit (StOMP) [18], and 
Compressive Sampling Matching Pursuit (CoSamp) [19]. These greedy approaches are relatively faster 
than their convex relaxation counterparts (approximately O(MNR) where R is the number of iterations). 

2) The need for randomness in the sensing matrix: Convex relaxation methods cannot make use of 
the structure exhibited by the sensing matrix (e.g., a structure that comes from a Toeplitz sensing matrix 
or that of a partial discrete Fourier transform (DFT) matrix). In fact, if anything, this structure is harmful 
to these methods as the best results are obtained when the sensing matrix is close to random. This comes 
in contrast to current digital signal processing architectures that only deal with uniform sampling. We 
would thus like to employ more feasible and standard sub-sampling approaches. 

3 ) Inability to harness a priori statistical information: Convex relaxation methods are not able to take 
account of any a priori statistical information (apart from sparsity information) about the signal support 
and additive noise. Any a priori statistical information can be used on the result obtained from the convex 
relaxation method to refine both the signal support obtained and the resulting estimate through a hypothesis 
testing approach [20]. However, this is only useful if these approaches are indeed able to recover the 
signal support. In other words, performance is bottle-necked by the support recovering capability of these 
approaches. We note here that the use of a priori statistical information for sparse signal recovery has 
been studied in a Bayesian context in |[T4l and in algorithms based on belief propagation Ell . l22l . Both 
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ll23l and |[24l use a priori statistical information (assuming x to be mixed Bernoulli-Gaussian); only |[23l 
uses this information in a recursive manner to obtain a fast sparse signal recovery algorithm. However, 
it is not clear how these approaches can be extended to the non-Gaussian case. 

4 ) Evaluating performance in statistically familiar terms: It is difficult to quantify the performance 
of convex relaxation estimates analytically in terms of the mean squared error (MSE) or bias or to relate 
these estimates to those obtained through more conventional approaches, e.g., maximum a posteriori 
probability (MAP), minimum mean-square error (MMSE), or maximum likelihood (ML)]^ 

5) Trading performance for computational complexity: In general, convex relaxation approaches do not 
exhibit the customary tradeoff between increased computational complexity and improved recovery as is 
the case for, say, iterative decoding or joint channel and data detection. Rather, they solve some i\ problem 
using (second-order cone programming) with a set complexity. A number of works have attempted to 
derive sharp thresholds for support recovery |[25l . ||26*1 . In other words, the only degree of freedom 
available for the designer to improve performance is to increase the number of measurements. Several 
iterative implementations 11271 . ll28ll of convex relaxation approaches provide some sort of flexibility by 
trading performance for complexity. 

B. Motivation and Paper Organization 

In this paper, we present a Bayesian approach to sparse signal recovery that has low complexity and 
makes a collective use of 1) a priori statistical properties of the signal and noise, 2) sparsity information, 
and 3) the rich structure of the sensing matrix, Although there have been some works that use the 
structure of the sensing matrix (e.g., ||29l ), it has not yet been rigorously exploited to aid in algorithm 
development and complexity reduction. We also show how our approach is able to deal with both Gaussian 
and non-Gaussian (or unknown) priors, and how we can compute performance measures of our estimates. 
In essence, we demonstrate how our technique enables us to tackle all the drawbacks of convex relaxation 
approaches. 

This remainder of this paper is organized as follows. We start by describing the signal model in the 
next section. In Section JIIJ we derive the MMSE/MAP estimates and introduce the various terms that 
need to be evaluated. In Section [TV] we demonstrate how the structure of the sensing matrix can be 

2 It is worth noting that convex relaxation approaches have their merit in that they are agnostic to the signal distribution and 
thus can be quite useful when worst-case analysis is desired as opposed to average-case analysis. 
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used to recover the sparse signal in a divide-and-conquer manner. Section [V] details the proposed sparse 
reconstruction algorithm that we call Orthogonal Clustering. Section [VI] presents the different structural 
properties of the sensing matrix that are exploited by the proposed algorithm to reduce the computational 
complexity. The performance of the proposed algorithm is compared with various sparse reconstruction 
algorithms presented in the literature by numerical simulations in Section IVIII which is followed by our 
conclusions in Section IVIII I 

C. Notation 

We denote scalars with lower-case letters (e.g., x), vectors with lower-case bold-faced letters (e.g., x), 
matrices with upper-case, bold-faced letters (e.g., X), and sets with script notation (e.g. S). We use Xj to 
denote the i th column of matrix X, x(j) to denote the j th entry of vector x, and <Sj to denote a subset 
of a set S. We also use X5 to denote the sub-matrix formed by the columns {xj : % € S}, indexed by 
the set S. Finally, we use x, x*, x T , and x H to respectively denote the estimate, conjugate, transpose, 
and conjugate transpose of a vector x. 

II. Signal Model 

We adopt the signal model in (Q]). Here, the vector x is modelled as x = xg x^, where denotes 
the Hadamard (element-by-element) multiplication. The entries of x^ are independent and identically 
distributed (i.i.d) Bernoulli random variables and the entries of xg are drawn identically and independently 
from some zero mean distribution^ In other words, we assume that x_e(i)s are Bernoulli with success 
probability p and similarly that the xc(i)s are i.i.d variables with marginal probability distribution function 
f(x). The noise n is assumed to be complex circularly symmetric Gaussian, i.e., n ~ CJ\f(Q, er^Ijw). 
When the support set S of x is known, we can equivalently write £[]) as 

y = ty s *s + n. (4) 

III. Optimum Estimation of x 

Our task is to obtain the optimum estimate of x given the observation y. We can pursue either an 
MMSE or a MAP approach to achieve this goal. In the following, we elaborate on how we can obtain 

3 Most of the results presented in this paper also apply to the case when the entries are independent but not necessarily 
identically distributed. 
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these two estimates. 



A. MMSE Estimation of x 

The MMSE estimate of x given the observation y can be expressed as 

xmmse = E[x|y] = ^p(5|y)E[x|y,5] (5) 
s 

where the sum is over all the possible support sets S of x. The likelihood and expectation involved in 
(f5]) are evaluated below. 

1) Evaluation of E[x|y, S]: Recall that the relationship between y and x is linear (see CD). Thus, 
in the case when x conditioned on its support is Gaussian, E[x|y,<S] is nothing but the linear MMSE 
estimate of x given y (and S), i.e., 

E[x 5 |y]=E[x|y,5]=a^ s -i y (6) 

where 



^E[yy H |5] = I M + 4*5*5- 



(7) 

When x|5 is non-Gaussian or when its statistics are unknown, the expectation E[x|y, «S] is difficult or 
even impossible to calculate. Thus, we replace it by the best linear unbiased estimate (BLUE), i.e., 

E[x 5 |y] = (*g* 5 ) _1 *Sy- (8) 
2) Evaluation of p(S\y): Using Bayes' rule, we can rewrite p(S\y) as 

As the denominator ^5 p(y|5)p(5) is common to all posterior likelihoods, p(S\y), it is a normalizing 
constant that can be ignored. To evaluate p(S), note that the elements of x are active according to a 
Bernoulli process with success probability p. Thus, p(S) is given by 

p(S) =pW(l-p) N ~W. (10) 

It remains to evaluate p(y|<S). Here, we distinguish between the cases of whether or not x\S is Gaussian. 
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1. x|<S is Gaussian: When x|<S is Gaussian, y is Gaussian too with zero mean and covariance 



and we can write the likelihood function at 

exp(-^||y|||j- t 
^ = de,(S,) ^ <»> 

up to an irrelevant constant multiplicative factor, (-^). 

2. x|«S is non-Gaussian or unknown: Alternatively, we can treat x as a random vector of unknown 
(non-Gaussian) distribution, with support S. Therefore, given the support S, all we can say about y is 
that it is formed by a vector in the subspace spanned by the columns of Vl/^, plus a white Gaussian noise 
vector, n. It is difficult to quantify the distribution of y even if we know the distribution of (the non- 
Gaussian) x. One way around this is to annihilate the non-Gaussian component and retain the Gaussian 
one. We do so by projecting y onto the orthogonal complement of the span of the columns of i.e., 
multiplying y by P^ = I — (^^^5) 1 This leaves us with P^y = P^n, which is zero mean 
and with covariance P^cr^P^" = cr^P^. Thus, the conditional density of y given S is approximately 
given by 



p{y\S) ~ exp ( — \ 

CT in 
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(12) 



B. MAP Estimation of x 

To obtain the MAP estimate of x, we first determine the MAP estimate of S, which is given by 

5map = argmaxp(y|S)p(S). (13) 

The prior likelihood p(y\S), is given by (ITTb when x|<S is Gaussian and by (TT2l when x|<S is non- 
Gaussian or unknown, whereas p(S) is evaluated using ( fTQb . The maximization is performed over all 
possible 2 N support sets. The corresponding MAP estimate of x is given by 

xmap = E[x|y,<S M Ap]. (14) 

One can easily see that the MAP estimate is a special case of the MMSE estimate in which the sum 
© is reduced to one term. As a result, we confine the discussion in the rest of the paper to MMSE 
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estimation. 

C. Evaluation over S 

Having evaluated the posterior probability and expectation, it remains to evaluate this over 2 N possible 
supports (see ((5]) and (fT3T >) which is a computationally daunting task. This is compounded by the fact 
that the calculations required for each support set in S are relatively expensive, requiring some form of 
matrix multiplication/inversion as can be seen from (l6l)- (fT2T ). One way around this exhaustive approach 
is somehow to guess at a superset S r consisting of the most probable support and limit the sum in © to 
the superset S r and its subsets, reducing the evaluation space to 2^ Sr ^ points. There are two techniques 
that help us guess at such a set S r . 

1. Convex Relaxation: Starting from £[)), we can use the standard convex relaxation tools fP2l . Ifl3l 
to find the most probable support set, S r , of the sparse vector x. This is done by solving (f3]) and retaining 
some largest P non-zero values where P is selected such that P(||<S|| > P) is very smalljfl 

2. Fast Bayesian Matching Pursuit (FBMP): A fast Bayesian recursive algorithm is presented in ll23l 
that determines the dominant support and the corresponding MMSE estimate of the sparse vector]^ It uses 
a greedy tree search over all combinations in pursuit of the dominant supports. The algorithm starts with 
zero active element support set. At each step, an active element is added that maximizes the Gaussian 
log-likelihood function similar to (fTTT) . This procedure is repeated until we reach P active elements in 
a branch. The procedure creates D such branches, which represent a tradeoff between performance and 
complexity!] 

The discussion in this section applies irrespective of the type of the sensing matrix, However, in 
many applications in signal processing and communications, the sensing matrix is highly structured. This 
fact, which has been largely overlooked in the CS literature, is utilized in the following to evaluate the 
MMSE (MAP) estimate at a much lower complexity than is currently available. 

5 As ||<S|| is a binomial distribution ~ B(N,p), it can be approximated by a Gaussian distribution ~ Af(Np, Np(l — p)), 



when Np > 5 (the DeMoivre-Laplace approximation 1301 ). In this case, P(||«S|| > P) — |erfc ^— ^=====J . 
6 FBMP applies to the Bernoulli Gaussian case only. 

7 Though other greedy algorithms |16|-|19| can also be used, we focus here on FBMP as it utilizes a priori statistical 
information along with sparsity information. 
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IV. A Structure-Based Bayesian Recovery Approach 

Whereas in most CS literature, the sensing matrix, SI/, is assumed to be drawn from a random 
constellation lfl2ll . lfl"3l . in many signal processing and communications applications, this matrix is 
highly structured. Thus, VE' could be a partial DFT matrix [5] or a Toeplitz matrix (encountered in 
many convolution applications Q). Table U lists various possibilities of structured 



TABLE I: Applications involving structured sensing matrices 



Matrix * 


Application 


Partial DFT 


OFDM applications including peak-to-average power ratio 
reduction |2|, narrow-band interference cancelation (5), 
and impulsive noise estimation and mitigation in DSL (5) 


Toeplitz 


Channel estimation |7 |, UWB (8), and DOA estimation (9) 


Hankel 


Wide-band spectrum sensing (311 


DCT 


Image compression [ 32 1 


Structured Binary 


Multi-user detection and contention resolution [33], [34] and 
feedback reduction [35 |, |36| 



Since Vl/ is a fat matrix (M << N), its columns are not orthogonal (in fact not even linearly 
independent). However, in the aforementioned applications, one can usually find an orthogonal subset of 
the columns of Vl/ that span the column space of Vl/. We can collect these columns into a square matrix, 
*&M- The remaining N — M columns of Vl/ group around these orthogonal columns to form semi- 
orthogonal clusters. In general, the columns of Vl/ can be rearranged such that the farther two columns 
are from each other, the lower their correlation is. In this section, we demonstrate how semi-orthogonality 
helps to evaluate the MMSE estimate in a divide-and-conquer manner. Before we do so, we present two 
sensing matrices that exhibit semi-orthogonality. 

A. Examples of Sensing Matrices with Semi-Orthogonality 

1) DFT Matrices: We focus here on the case when the sensing matrix is a partial DFT matrix, 
i.e., * = SFat, where Fjy denotes the N x N unitary DFT matrix, [Fjv]^ = -j= e - j2nab / N with 
a, b G {0, 1, . . . , N — 1} and S is an M x N selection matrix consisting of zeros with exactly one entry 
equal to 1 per row. To enforce the desired semi-orthogonal structure, the matrix S usually takes the form 
S = [OmxZ ImxM Omx(n-z-M)], for some integer Z. In other words, the sensing matrix consists 
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of a continuous band of sensing frequencies. This is not unusual since in many OFDM problems, the 
band of interest (or the one free of transmission) is continuous. In this case, the correlation between two 
columns can be shown to be 



sm(-w{k-k')M/N) 
M sin(7r (k-k')/N) 



(k = k') 
(k ? k') 



(15) 



which is a function of the difference, (k — k') mod N. It thus suffices to consider the correlation of one 
column with the remaining ones. Figure Q] illustrates this correlation for N = 1024 and M = 256. It is 
worth noting that the matrix * exhibits other structural properties (e.g., the fact that it is a Vandermonde 
matrix), which helps us reduce the complexity of the MMSE estimation (see Section [VI] for further 
details). 

2) Toeplitz/Hankel Matrices: We focus here on the Toeplitz case. The discussion can be easily extended 
to the Hankel case. A sub-sampled convolutional linear system can be written in the following matrix 
form, y = \I/x + n, where y is a vector of length M, x is a vector of length N and \I/ is the M x N 
block Toeplitz/diagonal matrix 



O 
O 

O O 



O 
O 





where the size of depends on the sub-sampling ratio. Here, ip^%jj k , = for \k — k'\ > L, and thus the 
columns of \I/ can easily be grouped into truly orthogonal clusters. Note also that the individual columns 
of are related to each other by a shift property, which we explore for further reduction in complexity 
in Section ED 



B. Using Orthogonality for MMSE Estimation 

Let S be a possible support of x. The columns of Vl/^ in (|4]) can be grouped into a maximum of C 
semi-orthogonal clusters, i.e., = \^S t ' ' ' *5o]» wnere is the support set corresponding to 
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the i th cluster (with i = 1, 2, • • • C)U Based on this fact, (@]) can be written as 



y = [* 5l *s 2 * 5c ] 



xi 

X 2 
XC 



+ n. 



(16) 



Columns indexed by these sets should be semi-orthogonal, i.e., ^^.^5 ~ 0; otherwise, Si and Sj are 
merged into a bigger superset. Now, the MMSE estimate of x simplifies td_ 



XMMSE = £ p(Z\y)E[x\y,Z]. 



(17) 



In the following, we show that xmmse can be evaluated in a divide-and-conquer manner by treating each 
cluster independently. To do so, we present in the following how orthogonality manifests itself in the 
calculation of the expectation and likelihood. 

1 ) The effect of orthogonality on the likelihood calculation: Recall that up to a constant factor, the 
likelihood can be written as p{Z\y) = p{y\Z)p{Z). Now, 

P (z) = pQJzd 

= plU^I(i_ p )JV-|U^| 

= p \Z l \+\Z a \+ - +\Za\n _p)^-(l^i|+[^l+-"+|Zo|) 

= p(Z 1 )p(Z 2 )---p(Zc) (18) 

where the equality in (TT~8T > is true up to some constant factor. Now, to evaluate p(y\Z), we distinguish 
between the Gaussian and non-Gaussian cases. For brevity, we focus here on the Gaussian case and 



8 Here, we denote the maximum number of clusters formed by C to distinguish it from P, that refers to the estimate of the 
number of active supports as in [23 1 (see footnote 5). In our approach, C is random and depends on a threshold. This threshold 
is obtained using the a priori statistical information of the noise signal, n. The procedure of forming semi-orthogonal clusters 
is presented in Section fV 7 ! 

9 In writing an expression like the one in J17t . it is understood that estimates of elements of x that do not belong to (J Si are 
identically zero. 
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extrapolate the results to the non-Gaussian case. Recall that 



1 IU,I|2 

p w = de,; Sz) s ' ' (19) 



with £z = I M + ^f*2*S- Here > = l&Zi ®z>] , where ^! z > = [*2 2 *£ 3 • • • *2 C ] • Using 



the matrix inversion lemma, we can write E^ 1 as 

si 1 = (i M +4*^r 1 = (iM+ !1 f*2 1 *s 1 + 11 f*^*S' 

2 2 

= ^-S^^'^' + S*^^ 1 *^)" 1 *^^; (20) 

where = 1^ + ^-*^*^. As and are almost orthogonal (i.e., SE^^Z' = — °), 

(l20l becomes 

(2 2 \ / 2 2 

~ -Im+(im + ^*^ 1 *S 1 ) +(lM + ^z-*%\ ■ 
Continuing in the same manner, it is easy to show that 

C / 2 \ ~ 1 

Ei 1 ~-(C-l)I M + ^(lM + ^|*^*S i ) ■ (22) 

i=l V °n / 

As such, we can write 

1 „ „o \ fC-\„ ,.,\ T^r ( 1 



0" exp (^r l|y||2 )n exp (-^i y i"k;) (23) 

f7 2 H 

where S^. = Im + * ] z^zc Using a similar procedure, we can decompose det(X!.z) as 

2 2 

det(S^) = det(I W + ^1*^*2, + ^f*^*S0 

2 2 

= det(I M + ^f* Zl *H )det(I Af + ^f*^^!*2') ( 24 ) 

2 2 

~ det(I M + %*2 1 *2 1 )det(I M + ^z^%) (25) 



det(E 2l )det(S^) (26) 
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where in going from (l24l ) to (|25T ). we used the fact that z x and VE" ' z> are almost orthogonal. Continuing 
in the same way, we can show that 



c 

det(5^)~ JJdet(E^). (27) 

i=l 

Combining (l23l and (|27T ). we obtain (up to an irrelevant multiplicative factor) 



C 

P (y\z) ~ jjf(yl^)- (28) 

i=l 

Orthogonality allows us to reach the same conclusion (|28T ) for the non-Gaussian case. Now, combining 
( fT8l ) and d28t , we can finally write 

c 

P (z\ y ) ~ jj^iy) (29) 

i=l 



which applies equally to the Gaussian and non-Gaussian cases. 

2 ) The effect of orthogonality on the expectation calculation: In evaluating the expectation, we again 
distinguish between the Gaussian and non-Gaussian cases. We focus here on the non-Gaussian case 
for which E[x^|y] = (^f^^fz) _1 ^^y- Using the decomposition into semi-orthogonal clusters \1/ z = 
[*f?Zi * z 2 ' ' ' *2c]> we can write 



l r 



*z*z 1 *z*z 2 



*"*z, 



■c 




(*3*s) -1 *Sy 




c 





E[x^|y] 



i.e., E[x 2 |y] 



(30) 



E[x 2c |y] 



Orthogonality allows us to write an identical expression to (l30l in the Gaussian case. 

3) The effect of orthogonality on the MMSE estimation: We are now ready to show how (semi)orthogonality 
helps with the MMSE evaluation. To do this, we substitute the decomposed expressions (l29l) and (l30l) 
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into (Q/7]) to get 



xmmse = £ p( 2: ly) E [ x ly' 2: ] 



£ Il^ly) 

ZiCSi, i=l,...,C i 



E[x|y,Zi] 
E[x|y,£ 2 ] 

E[x|y,Z c ] 



E 22 c5 2 p(^|y)E[x| y) z 2 ] 



(31) 



T,z c csc p( z c\yM*\y, z c ] 

where the last line follows from the fact that Ez PO^i |y) = 1- Thus, the semi-orthogonality of the 
columns in the sensing matrix allows us to obtain the MMSE estimate of x in a divide-and-conquer 
manner by estimating the non-overlapping sections of x independently from each other. Other structural 
properties of * can be utilized to reduce further the complexity of the MMSE estimation. For example, 
the orthogonal clusters exhibit some form of similarity and the columns within a particular cluster are 
also related to each other. We explore these properties for complexity reduction in Section [Vl] However, 
before doing so, we devote the following section to a full description of our Bayesian orthogonal clustering 
algorithm. 

V. An Orthogonal Clustering (OC) Algorithm for Sparse Reconstruction 

In this section, we present our sparse reconstruction algorithm, which is based on orthogonal clustering. 
The main steps of the algorithm are detailed in the following and summarized in Figure [2] 



A. Determine dominant positions 

Consider the model given in ([TJ) reproduced here for convenience, y = \l/x + n. By correlating the 
observation vector, y, with the columns of the sensing matrix, and by retaining correlations that 
exceed a certain threshold, we can determine the dominant positions/regions where the support of the 
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sparse vector, x, is located. The performance of our orthogonal clustering algorithm is dependent on this 



initial correlation-based guessl 



B. Form semi-orthogonal clusters 

Define a threshold k such that p(n > k) = p n is very smalll^l The previous correlation step creates a 
vector of N correlations. From these correlations, obtain the indices with the correlation greater than the 
threshold, k. Let ijdenote the index with the largest correlation above k and form a cluster of size L 
centered around ijo Now, let i% denote the corresponding index of the second largest correlation above 
k and form another cluster of size L around i^. If the two clusters thus formed are overlapping, merge 
them into one big cluster. Continue this procedure until all the correlations greater than k are exhausted. 



C. Find the dominant supports and their likelihoods 

Let Li be the length of cluster i and let P c denote the maximum possible support size in a cluster!^ 
Let C be the total number of semi-orthogonal clusters formed in the previous step. For each of them, find 
the most probable support of size, \S\ = 1, |<S| = 2, • • • , |5| = P c , by calculating the likelihoods for all 
supports of size |5| (using either (fTTT) or (fT2l ). Each cluster is processed independently by capitalizing 
on the semi-orthogonality between the clusters. The expected value of the sparse vector x given y and 
the most probable support for each size can also be evaluated using either © or © depending on the a 
priori statistical information. 

D. Evaluate the estimate of x 

Once we have the dominant supports for each cluster, their likelihoods, the expected value of x given 
y and the dominant supports, the MMSE (or MAP) estimates of x can be evaluated as discussed in 

10 We can also apply a convex relaxation approach, retain the P largest values, and form clusters around them. This allows 
us to incorporate a priori statistical information and obtain MMSE estimates but the algorithm in this case is bottle-necked by 
the performance of the convex relaxation approach and also loses the appeal of low complexity. 

11 As n ~ 7V(0, a^), the threshold can be easily evaluated as, k = v / 2o^erfc _1 (2p n ). 

12 Given a fat sensing matrix, we consider two columns to be orthogonal (or semi orthogonal) when their correlation is below 
some value, e. The cluster size L is thus the minimum separation between two columns that makes these two columns semi- 
orthogonal. Obviously, the distance ,L, is a function of the correlation tolerance, e. The lower the tolerance, e, the larger the 
cluster size, L. 

13 P c is calculated in a way similar to P as the support in a cluster is also a Binomial distribution ~ B(Li,p). Thus, we set 
P c = [erfc- 1 (10 _2 ) v /2L 4 p(l - p) + L iP ] (see footnote 5). 
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Section JV] (see (I3TT)). Note that these estimates are approximate as they are evaluated using only the 
dominant supports instead of using all supports. 

VI. Reducing the Computational Complexity 

In this paper, we explore three structures of the sensing matrix that help us to reduce the complexity 
of MMSE estimation. 

1) Orthogonality (independence) of clusters: In Section |lVl the orthogonality of clusters allowed us 
to calculate the MMSE estimate independently over clusters in a divide-and-conquer manner. 

2) Similarity of clusters: While the columns of the clusters are (semi)orthogonal, allowing us to treat 
them independently, these columns could exhibit some form of similarity making some MMSE 
calculations invariant over these clusters. For example, the columns of a DFT matrix can be 
obtained from each other through a modulation operation while those of the Toeplitz matrix can 
be obtained through a shift operation. The correlation calculations that repeatedly appear in the 
MMSE estimation are invariant to the modulation and shift operations. 

3) Order within a cluster: MMSE estimation in a cluster involves calculating the likelihoods and 
expectations for all supports of size i = 1, 2, • • ■ , P c . Several quantities involved in these evaluations 
can be obtained in an order-recursive manner, incrementally moving from calculations for supports 
of size i to similar calculations for supports of size i + 1. 

We explore the last two properties in the following subsections. 

A. Similarity of Clusters 

As evident from the previous sections, calculating the likelihood can be done in a divide-and-conquer 
manner by calculating the likelihood for each cluster independently. This is a direct consequence of the 
semi-orthogonality structure of the columns of the sensing matrix. Moreover, due to the rich structure of 
the sensing matrix, the clusters formed are quite similar. In the following subsections, we use the structure 
present in DFT and Toeplitz sensing matrices to show that the likelihood and expectation expressions 
in each cluster (for both the Gaussian and non-Gaussian cases) are strongly related, allowing many 
calculations across clusters to be shared. 
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1) Discrete Fourier Transform (DFT) Matrices: Let tp 1 , ip 2 i i V'i denote the sensing columns 
associated with the first cluster. Then, it is easy to see that the corresponding columns for the i th cluster 
of equal length that are Aj columns away are, ip x tp A ., ., • • • , ., where is 



some constant vector that depends on the sensing columns^ Assume that we evaluate the likelihood, 
p(Z\\y), and expectation, E[x|y, Z{\, for a set of columns, Z\, in the first cluster. For this set, we make 
the assumption that 

y = * 2l x + n. (32) 

Now, let Zi denote the same set of columns chosen from the i th cluster that is Aj columns away (in 
other words Zi = Z\ + Aj). For this set, we assume that 

y = + n. (33) 

Now (Hadamard) multiply both sides of the above equation by to get 

y04 = * Zl x + n0^. (34) 

Note that (l32l) and (l34l have the same sensing matrix and the same noise statistics (n is a white circularly 
symmetric Gaussian and hence is invariant to multiplication by ip\ .). The only difference is that y is 
modulated by the vector in moving from the first to the i th cluster. This allows us to write 

p{Z i \y)=p{Z 1 \yOil>* Ai ) and E[x|y, Z-\ = E[x|y Z x \ (35) 

which is valid for both the Gaussian and non-Gaussian cases. In other words, if Zi is obtained from Z\ 
by a constant shift, then any y-independent calculations remain the same while any calculations involving 
y are obtained by modulating y by the vector as shown in Figure [3] For example, the likelihood in 
the Gaussian case reads 

exp (-||y|||,-i ) exp ( -||y © V'aJIIj-i 



14 For example, if we use the last M rows of the DFT matrix to construct the sensing matrix, then if} l 



exp I 



]2tt(N-M) 
N 



Ai ) exp pW^^ A,) ... ex p(-^i) Al )] T . 
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and, in the non-Gaussian case, it reads 

p(y\Zi) ~ exp (-||y||px ) = exp (-||y V> Ai HpiJ ■ ( 37 ) 
We observe similar behavior in calculating the expectation. Thus, in the Gaussian case, we have 

E[x|y,^] = ol*\ V-]y = a 2 x * H Zi S^y©^) (38) 
and in the non-Gaussian case, we have 

E[x| y ,z,] = (^.^j^^y^^^j^Kyeft,)' 09) 

2) Toeplitz/Hankel Matrices: In the Toeplitz or block Toeplitz case, the sensing matrix reads * = 
[V&Sj *&s 2 • • • ^5 C ]- Now, the clusters can be modified to make sure that they are identical (by stretching 
their end points if necessary) such that tyg. = [ O • • • O @ T O • • • O ] T - m other words, the 
s are simply shifted versions of each other. We now calculate the quantities de^E^J, llyllv 1 ' an( ^ 

1 1 2 

| y| p-l for a set Z\ of columns of the first cluster. We then choose an identical set of columns, Z^, in 
the i th cluster. Then, it is intuitively clear that 

det(E z J =det(E 2l ), ||y|||;-i = ||y Wi|||,-i , and ||y|| Pi = ||y w^L (40) 

where Wj is a rectangular window corresponding to the location of the non-zero rows of 

B. Order within a cluster 

To evaluate the likelihood for supports of size i = 1, 2, P c in a single cluster, we pursue an order- 
recursive approach, calculating the likelihood and expectation for supports of size i + 1 by updating 
calculations made for supports of size i. In the following, we assume that we have calculated the likelihood 
and expectation involving the columns, which we would like to update to ^s> = [^s 

I) x|<S is Gaussian: To calculate the likelihood Cs> = d e t(E — j S ' w ^ = ^-M+^S-^S'^S'' 

note that S5/ = ~E$ + ^fi/>j'0i\ or by the matrix inversion lemma, 

2 

E" 1 = E- 1 -^^ (41) 
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where 



A 



1 + 

^ IT) 



1 + 



(7 



As we are actually interested in computing exp (— ^-||y|||,_i), using (|4TT ) we obtain 



exp — 2 y 



i ^ II Il2 , °a;£i m H ii 2 

«p l-^lly|l E - + -^-Ky|l 



exp 



1 



exp 



K H y|| 2 



The determinant of £s< can be evaluated as follows: 

v 2 \ / ^2 



Thus, the likelihood for the support of size S' can be written as (using (|44l) and (@5]l), 



exp i -^l|y|||;-i) ex P (^fll^yll 2 



det(E 5 )e. 



-i 



Cs ^exp(^K H y|| 2 



(42) 
(43) 



(44) 



det(£sO = det (v s + ^i^f) = det (l + ^f^EjVi) det(S 5 ) = ^ det(E 5 ) . (45) 



(46) 



This shows that to calculate C$>, we need to compute only u>j and £j, which constitute £j. To calculate 
uji for a cluster of length, L, 0(LM 2 ) operations is required if standard matrix multiplication is used. 
This complexity can be reduced to O(LM) by storing all the past computed values of u and £ and using 
the structure of S 5 11231 . 

Similarly, E[x$/|y] can be calculated in an order-recursive manner as follows: 



![xs'|y] 



E[x5|y] 



(47) 



2) x\S is unknown: To calculate the likelihood in the non-Gaussian case, we need to evaluate the 



norm, ||y|| p j 



y H \l/S' (^5, ^5') ^^y- Our approach mainly hinges on calculating the 
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inverse A$> 



recursively. We do this by invoking the block inversion formula 



A s + j-Uiu" 



(48) 
(49) 



where 



& = nv i |i 2 -(v l H *5)A5(*^) = ii^ 



(50) 
(51) 



with the elements of 77^ = S&sVj ai l available (i.e., they are calculated initially and can be reused 
afterwards). Using this recursion, we can construct (following some straightforward manipulation) a 
recursion for the projected norm C$''- 



cxp ( — \ f 

^ \ --72 r 



n 



|y|| 2 -y H * 5 'A 5 ,*^y 



|yf - y H * 5 A 5 *5y 



cxp I — \ 

at 



-i|(y H *5)^| 2 + |"Re{( y H ^K H (lJr H y)} _ i_| y H^. |2 

si si si 



£ 5 exp ( _ Ky^s^l 2 - 2Re{(y H ^)^ H (*H y)} + | y H^ ,2 



(52) 



Similarly, we can show that 

wl . / t h \ [ E[x 5 |y] + i^E^ly] - Jwi^y 

E[x 5 /|y] = A^/(*g,y) = ^ 4 * (53) 

-i^i[x5|y] + ^ H y 

The cluster independent and cluster-wise evaluations in our recursive procedure for both the cases (x|<S 
Gaussian or unknown) are summarized in Table |n] 



VII. Simulation Results 

In this section, we compare the performance of the OC algorithm with popular sparse reconstruction 
methods available in the literature including the convex relaxation (CR) method lfl2l . OMP lfl"5l . and 
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TABLE II: Cluster independent and cluster-wise evaluations involved in the recursive procedure for 
complexity reduction within a cluster 





Cluster Independent Evaluations 


Cluster-wise Evaluations 


x|<S is Gaussian 


Evaluate u>i and £j using d42li and (03]) 
Update Eg 1 using (HTt 
Update det(Ss) using (|45]i 


Evaluate ||a;^y|| 2 

Update Cs using equation d46b 

Update E[xs'|y] using equation fflfl 


x|<S is unknown 


n 

Initialize: Calculate tp i tpj V i,j 
Evaluate a;, using equation (l5Qb 
Update Eg using equations d48l) and (TSTI) 


Initialize: Evaluate y H i/' i V i 
Update £5 using equation d52b 
Update E[xs'|y] using equation (l53l 



FBMP ll23l . The parameters of these algorithms are set according to the specifications provided by the 
authors to achieve the best resultscj The parameters that we use in all the simulations are N = 800, 
M = f = 200, p = 10~ 2 , and SNR = 30dB (unless stated otherwise). Specifically, we demonstrate the 
performance of our algorithm for the case when the sensing matrix is a DFT or a Toeplitz matrix. We 
start by first investigating the effect of cluster length on the performance of OC. 



A. The effect of the cluster length, L 

Figure 0] compares the normalized mean-square error (NMSE) of OC as the cluster length, L, is varied. 

1 R ||x< r >-X< r > II 2 

The NMSE is defined as NMSE = J2 r =i iix^lp ' wnere ^ stan ds for the estimated sparse signal 
for realization r, and R is the total number of runs. For this case, the DFT matrix is used as the sensing 
matrix with x|<S Gaussian. Note that while implementing OC with fixed-length clusters, overlapping of 
clusters is not allowed to maintain orthogonality. This results in an increase in the probability of missing 
the correct support if two supports are close to each other. Thus, the smaller the cluster, the greater the 
probability of missing the correct supports. This is evident from Figure [4] as performance of OC improves 
by increasing L. Obviously, this improvement in performance is obtained at the expense of speed. Figure 
[5] shows that the smaller the length of clusters, the faster the algorithm. Note that for larger values of 
L (e.g., L > 32), it might not be possible to form the required number of non-overlapping clusters. To 
overcome this problem, we present the performance of OC implemented with variable length clusters 
(labeled as "OC" in Figure 0]). In this case, the overlapping clusters are joined together to form larger 
clusters. It can be observed from Figure 0] that the performance of OC with variable-length clusters is 

15 For a fair comparison, we perform the MMSE refinement on the output of CR and OMR 
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better than the case when it is implemented with fixed-length clusters. Moreover, this performance is 
achieved with a reasonable run-timd_j as shown in Figure [5] 

B. The effect of the signal-to-noise ratio (SNR) 

Figure [6] compares the performance of the algorithms for the case when the sensing matrix is a DFT 
matrix and x|<S is Gaussian. In the FBMP implementation, the number of greedy branches to explore 
(D) is set to 10. Note that OC outperforms all other algorithms at low SNR while FBMP performs 
quite close to it at SNR > 25 dB. It outperforms both OMP and CR at all SNR values. Specifically, 
at SNR = 25 dB, OC has a gain of approximately 2 dB and 3 dB over CR and OMP, respectively. 
The performance of the algorithms for the case when the sensing matrix is a DFT matrix and x|«S is 
unknown is presented in Figure [7] In this case, the entries of are drawn from a uniform distribution. 
Here, FBMP is allowed to estimate the hyper-parameters using its approximate ML algorithm (with E 
set to 10)[23|. It can be seen that OC easily outperforms OMP and FBMP while CR performs similar to 
OC. Specifically, at SNR = 25 dB, OC outperforms OMP and FBMP by approximately 5 dB. Figure [8] 
compares the performance of the algorithms for the case when the sensing matrix is Toeplitz. To do so, 
we first generate a Toeplitz matrix from a column having 20 non-zero consecutive samples drawn from 
a Gaussian distribution. The sensing matrix is then extracted by uniformly sub-sampling this full matrix 
at a rate less than the duration of the signalO Note that the performance of OC and FBMP is almost 
the same at low SNR but OC outperforms FBMP in the high SNR region. OMP and CR do not perform 
well in this case as the sensing matrix does not exhibit the requisite incoherence conditions (in this case, 
~ 0.9) on which much of the CS theory is based. 

C. The effect of the under-sampling ratio (|y) 

Figure [9] shows the performance of the algorithms (for the case when the sensing matrix is DFT and 

N_- 

M ■ 



x|<S is Gaussian) when the under-sampling ratio (jj) is varied. It can be observed that the performance 



of all the algorithms deteriorates as |4 increases. OC and FBMP perform quite close to each other with 
OC performing slightly better at high (-p) ratios. 



16 Thus, the following simulation results are presented with OC implemented using variable length clusters. 
17 In this case, the sub-sampling rate is 4 times less making M — 200. 
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D. The effect of the sparsity rate, p 

Figure [10] compares the performance of the algorithms when the sparsity rate, p, is varied (for the case 
when the sensing matrix is DFT and x|<S is Gaussian). It can be seen that the performance of OC is 
quite close to CR and FBMP at low sparsity rate while it outperforms OMP by approximately 3 dB for 
the studied range of p. The performance of OC deteriorates at the high sparsity rate because the number 
of clusters increases as p increases and the probability of clusters to be near or overlapping each other 
increases. Thus, in this case, the orthogonality assumption of OC becomes weak. Figure [TT] compares 
the mean run-time of all the algorithms. It can be seen that OC is faster than all other algorithms. As 
sparsity rate increases, the length of the clusters increases, and thus the complexity of OC. Figure [12] 
shows that OC performs quite well at the low sparsity rate in the case when the sensing matrix is DFT 
and x|<S is unknown. FBMP does not perform well at the low sparsity rate in this case even with its 
approximate ML algorithm. The run-time of FBMP is also higher as compared to Figure [10] due to the 
time taken to estimate the hyper-parameters using the ML algorithm. In the case of the Toeplitz matrix 
(see Figure [141 ). the performance of OC and FBMP is almost the same while the performance of CR 
and OMP is quite poor due to the weak incoherence of the sensing matrix. It can also be observed from 
Figure [15] that OC is quite fast compared to the other algorithms. 

VIII. Conclusion and Future Work 

In this paper, we present the Orthogonal Clustering algorithm for fast Bayesian sparse reconstruction. 
This algorithm makes collective use of the underlying structure (sparsity, a priori statistical information, 
structure of the sensing matrix) to achieve superior performance at much lower complexity compared with 
other algorithms especially at low sparsity rates. The proposed algorithm has the following distinctive 
features. 

1) It is able to deal with Gaussian priors as well as with priors that are non-Gaussian or unknown. 

2) It utilizes the structure of the sensing matrix, including orthogonality, modularity, and order- 
recursive calculations. 

3) In the Gaussian case, OC beats all other algorithms in terms of complexity and performance for low 
sparsity rates. In the non-Gaussian case, it outperforms all other algorithms (most notably FBMP) 
for both low and high sparsity rates. Hence, the only disadvantage of OC is its performance at high 
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sparsity rates. In this case, the clusters are no longer orthogonal, which results in large clusters and 
the orthogonality assumption becomes invalid. Fortunately, this drawback is only observed in the 
Gaussian case while in the non-Gaussian case, OC maintains a relative advantage over the other 
algorithms for all sparsity rates. 
4) It is able to provide computable measures of performance (See @ for details on how to calculate 
the error covariance matrix using orthogonality). 
Our future work includes 

1) The OC algorithm assumes that various clusters do not interact. We guarantee this by lumping any 
two clusters that are too close into a single larger cluster. This prevents us from implementing a 
fixed-size cluster algorithm and gives our algorithm the advantage of being computationally cleaner 
and more efficient. A prerequisite to do so however is to implement an OC that takes into account 
the interaction between neighboring clusters. 

2) The OC algorithm utilizes various levels of structure in the sensing matrix but falls short of 
utilizing one additional structure. Specifically, the various columns of any cluster are not random 
but are actually related (e.g., adjacent columns in the Toeplitz case exhibit a shift structure)!^ This 
additional structure can be used to reduce further the complexity of our algorithm. 

3) The OC algorithm does not use any dependence between the active sparse elements (e.g., block 
sparsity). It can be specialized to deal with such cases. 

4) The divide-and-conquer approach that we are able to pursue due to the structure of the sensing 
matrix can be utilized in the existing algorithms like OMR 
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Fig. 1 : The 500*' 1 column has high correlation with its neighbors 
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T 



Find E[x|y,cS] for dominant support of each size 



Evaluate x M MSE or x M AP 



det(S^) AS-* 
(or Pi) — 



Modulator 



Calculate 
p(Zi\y) andE[x|y,Zi; 



Fig. 3: Block diagram of the reduced complexity algo- 
rithm for the DFT matrix 



Fig. 2: Flowchart of the OC algorithm 
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Fig. 8: NMSE vs SNR for the Toeplitz matrix and x|SFig. 9: NMSE vs the undersampling ratio (^) for the 
Gaussian. DFT matrix and x|«S Gaussian. 
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Fig. 10: NMSE vs p for the DFT matrix and x|SFig. 11: Mean run-time for the DFT matrix and x|<S 
Gaussian. Gaussian. 
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Fig. 12: NMSE vs p for the DFT matrix and x|SFig. 13: Mean run-time for the DFT matrix and x|<S 



unknown. 



unknown. 



a ♦ * 




0.005 0.01 0.015 0.02 0.025 

P 



i 10° 



A. ... A A A A A A 



e - o 



_o- -o- 



- o 



■ V CR [12] 
-O- FBMP [25] 
•★-OMP [17] 
-B— OC 



0.005 0.01 0.015 0.02 0.025 

P 



Fig. 14: NMSE vs p for the Toeplitz matrix and x|<SFig. 15: Mean run-time for the Toeplitz matrix and x|<S 
Gaussian. Gaussian. 



